An Improved Approach to perform Crawling and avoid Duplicate Web Pages
نویسندگان
چکیده
When a web search is performed it includes many duplicate web pages or the websites. It means we can get number of similar pages at different web servers. We are proposing a Web Crawling Approach to Detect and avoid Duplicate or Near Duplicate WebPages. In this proposed work we are presenting a keyword Prioritization based approach to identify the web page over the web. As such pages will be identified it will optimize the web search.
منابع مشابه
Prioritize the ordering of URL queue in Focused crawler
The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...
متن کاملAn Efficient Approach for Near-duplicate page detection in web crawling
The drastic development of the World Wide Web in the recent times has made the concept of Web Crawling receive remarkable significance. The voluminous amounts of web documents swarming the web have posed huge challenges to the web search engines making their results less relevant to the users. The presence of duplicate and near duplicate web documents in abundance has created additional overhea...
متن کاملA Survey of Duplicate And Near Duplicate Techniques
--World Wide Web consists of more than 50 billion pages online. The advent of the World Wide Web caused a dramatic increase in the usage of the Internet. The World Wide Web is a broadcast medium where a wide range of information can be obtained at a low cost. A great deal of the Web is replicate or nearreplicate content. Documents may be served in different formats: HTML, PDF, and Text for diff...
متن کاملThe improved Shark Search Approach for Crawling Large-scale Web Data
Web crawling is an important approach for collecting larger-scale web data on, and keeping up with, the rapidly expanding Internet. This paper puts forward the improved shark search approach for crawling large-scale Web data based on link clustering and the technology of tunnel. In this study we focus on the classification of Web links instead of downloaded web pages to determine relevancy whic...
متن کاملPerformance and Comparative Analysis of the Two Contrary Approaches for Detecting Near Duplicate Web Documents in Web Crawling
Recent years have witnessed the drastic development of World Wide Web (WWW). Information is being accessible at the finger tip anytime anywhere through the massive web repository. The performance and reliability of web engines thus face huge problems due to the presence of enormous amount of web data. The voluminous amount of web documents has resulted in problems for search engines leading to ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2012